Case Study 2

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

Data import

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    870 obs. of  36 variables:
 $ ID                      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Age                     : num  32 40 35 32 24 27 41 37 34 34 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
 $ DailyRate               : num  117 1308 200 801 567 ...
 $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Sales" ...
 $ DistanceFromHome        : num  13 14 18 1 2 10 5 10 10 10 ...
 $ Education               : num  4 3 2 4 1 2 5 4 4 4 ...
 $ EducationField          : chr  "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
 $ EmployeeCount           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : num  859 1128 1412 2016 1646 ...
 $ EnvironmentSatisfaction : num  2 3 3 3 1 4 2 4 3 4 ...
 $ Gender                  : chr  "Male" "Male" "Male" "Female" ...
 $ HourlyRate              : num  73 44 60 48 32 32 90 88 87 92 ...
 $ JobInvolvement          : num  3 2 3 3 3 3 4 2 3 2 ...
 $ JobLevel                : num  2 5 3 3 1 3 1 2 1 2 ...
 $ JobRole                 : chr  "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
 $ JobSatisfaction         : num  4 3 4 4 4 1 3 4 3 3 ...
 $ MaritalStatus           : chr  "Divorced" "Single" "Single" "Married" ...
 $ MonthlyIncome           : num  4403 19626 9362 10422 3760 ...
 $ MonthlyRate             : num  9250 17544 19944 24032 17218 ...
 $ NumCompaniesWorked      : num  2 1 2 1 1 1 2 2 1 1 ...
 $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : num  11 14 11 19 13 21 12 14 19 14 ...
 $ PerformanceRating       : num  3 3 3 3 3 4 3 3 3 3 ...
 $ RelationshipSatisfaction: num  3 1 3 3 3 3 1 3 4 2 ...
 $ StandardHours           : num  80 80 80 80 80 80 80 80 80 80 ...
 $ StockOptionLevel        : num  1 0 0 2 0 2 0 3 1 1 ...
 $ TotalWorkingYears       : num  8 21 10 14 6 9 7 8 1 8 ...
 $ TrainingTimesLastYear   : num  3 2 2 3 2 4 5 5 2 3 ...
 $ WorkLifeBalance         : num  2 4 3 3 3 2 2 3 3 2 ...
 $ YearsAtCompany          : num  5 20 2 14 6 9 4 1 1 8 ...
 $ YearsInCurrentRole      : num  2 7 2 10 3 7 2 0 1 2 ...
 $ YearsSinceLastPromotion : num  0 4 2 5 1 1 0 0 0 7 ...
 $ YearsWithCurrManager    : num  3 9 2 7 3 7 3 0 0 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Age = col_double(),
  ..   Attrition = col_character(),
  ..   BusinessTravel = col_character(),
  ..   DailyRate = col_double(),
  ..   Department = col_character(),
  ..   DistanceFromHome = col_double(),
  ..   Education = col_double(),
  ..   EducationField = col_character(),
  ..   EmployeeCount = col_double(),
  ..   EmployeeNumber = col_double(),
  ..   EnvironmentSatisfaction = col_double(),
  ..   Gender = col_character(),
  ..   HourlyRate = col_double(),
  ..   JobInvolvement = col_double(),
  ..   JobLevel = col_double(),
  ..   JobRole = col_character(),
  ..   JobSatisfaction = col_double(),
  ..   MaritalStatus = col_character(),
  ..   MonthlyIncome = col_double(),
  ..   MonthlyRate = col_double(),
  ..   NumCompaniesWorked = col_double(),
  ..   Over18 = col_character(),
  ..   OverTime = col_character(),
  ..   PercentSalaryHike = col_double(),
  ..   PerformanceRating = col_double(),
  ..   RelationshipSatisfaction = col_double(),
  ..   StandardHours = col_double(),
  ..   StockOptionLevel = col_double(),
  ..   TotalWorkingYears = col_double(),
  ..   TrainingTimesLastYear = col_double(),
  ..   WorkLifeBalance = col_double(),
  ..   YearsAtCompany = col_double(),
  ..   YearsInCurrentRole = col_double(),
  ..   YearsSinceLastPromotion = col_double(),
  ..   YearsWithCurrManager = col_double()
  .. )

The data set has 870 observations and 36 variables for us to work with.

Pre-Processing

Here I am doing some processing of the data. I am going to convert some character columns to factors to make modeling easier. I am also going to add variables to this step as I proceed with my analysis

colsToFactor <- c("Attrition", "BusinessTravel", "Department", "EducationField", 
    "Gender", "JobRole", "MaritalStatus", "Over18", "OverTime", "StockOptionLevel", 
    "JobLevel", "JobInvolvement", "Education", "EnvironmentSatisfaction", "JobSatisfaction", 
    "RelationshipSatisfaction", "WorkLifeBalance", "PerformanceRating")
# Consider 'StockOptionLevel','JobLevel','JobInvolvement','Education'; They
# could be coded as numeric
casedata[, colsToFactor] <- lapply(casedata[, colsToFactor], as.factor)

casedata$logMonthlyIncome <- log(casedata$MonthlyIncome)

casedata$IncomeLt4000 <- ifelse(casedata$MonthlyIncome <= 4000, 1, 0)
casedata$DistHomeFactor <- cut(casedata$DistanceFromHome, c(0, 10, 20, 30), 
    labels = c("Close", "Medium", "Far"), include.lowest = TRUE)
casedata$AgeGroup <- cut(casedata$Age, c(18, 25, 35, 45, 60), labels = c("18-25", 
    "25-35", "35-45", "45-60"), include.lowest = TRUE)
casedata$NumCompCat <- cut(casedata$NumCompaniesWorked, c(0, 2, 6, 9), labels = c("0-2", 
    "2-6", "6-9"), include.lowest = TRUE)
casedata$WorkingYearsGroup <- cut(casedata$TotalWorkingYears, c(0, 5, 10, 15, 
    20, 40), labels = c("0-5", "5-10", "10-15", "15-20", "20-40"), include.lowest = TRUE)
casedata$RoleYearsGroup <- cut(casedata$YearsInCurrentRole, c(0, 3, 6, 10, 20), 
    labels = c("0-3", "3-6", "6-10", "10+"), include.lowest = TRUE)
casedata$CompanyYearsGroup <- cut(casedata$YearsAtCompany, c(0, 3, 10, 20, 40), 
    labels = c("0-3", "3-10", "10-20", "20-40"), include.lowest = TRUE)
casedata$IncomeGroup <- cut(casedata$MonthlyIncome, c(0, 4000, 8000, 12000, 
    16000, 20000), labels = c("<$4K", "$4K - $8K", "$8K-$12K", "$12K-$16K", 
    "$16K-$20K"), include.lowest = TRUE)

vars <- c("MonthlyIncome", "Age", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", 
    "YearsWithCurrManager", "YearsInCurrentRole", "YearsSinceLastPromotion")

# This is an issue because of zeroes
casedata <- casedata %>% mutate_at(vars, list(log = log))

Exploratory Analysis

In this section I am going to look mostly at individual variables to try to get a sense of what each variable means and whether or not I should consider it for further analysis and modeling.

Attrition

Attrition Count Proportion
No 730 83.9%
Yes 140 16.1%

Out of the 870 employees, 140 left their jobs, which is 16% attrition

Business Travel

BusinessTravel Count Proportion
Non-Travel 94 10.8%
Travel_Frequently 158 18.2%
Travel_Rarely 618 71.0%

681 employees or 71% travel rarely. It seems like the most frequent travelers have the highest attrition rates, and non-travelers have the lowest. Non-travelers have a lower 75th percentile on income, Frequent and non-frequent travelers have similar pay.

Department

Department Count Proportion
Human Resources 35 4.02%
Research & Development 562 64.6%
Sales 273 31.4%

The sales department has the highest attrition rate and R&D has the lowest attrition rate. Mean income is fairly similar but their is a clear difference in medians with HR having the lowest median pay and Sales having the highest median pay.

Education

Education Count Proportion
1 98 11.3%
2 182 20.9%
3 324 37.2%
4 240 27.6%
5 26 2.99%

Assuming higher values of Education mean more advance education, better educated employees seem to have lower attrtion rates. 1-3 are similar but 4 & 5 have a clear decrease. Education 5 also has a very clear pay advantage.

EducationField Count Proportion
Human Resources 15 1.72%
Life Sciences 358 41.1%
Marketing 100 11.5%
Medical 270 31.0%
Other 52 5.98%
Technical Degree 75 8.62%

Those with education in HR have the highest attrition rates in this sample. Education in Human Resources seems to come with lower median pay, but those educated in marketing have higher median pay

Environment Satisfaction

EnvironmentSatisfaction Count Proportion
1 172 19.8%
2 178 20.5%
3 258 29.7%
4 262 30.1%

Assuming lower is less satisfied, those least satisfied with their enviroment have the highest attrition rates. Mean and Median pay is similar between groups here, but the 75th percentile is lower for groups 2&4. Not sure how this is meaningful to pay overall.

Gender

Gender Count Proportion
Female 354 40.7%
Male 516 59.3%

Attrition rates are similar between genders, with Men having just a slightly higher rate. In terms of pay, Men are making a little less than average in this sample.

Job Involvement

JobInvolvement Count Proportion
1 47 5.40%
2 228 26.2%
3 514 59.1%
4 81 9.31%

Those with lower job involvement have much higher attrition. The relationship with pay is much less clear.

Job Level

JobLevel Count Proportion
1 329 37.8%
2 312 35.9%
3 132 15.2%
4 60 6.90%
5 37 4.25%

Job level 1 has a much higher attrition rate. Job level also seems very linearly associated with pay.

Job Role

JobRole Count Proportion
Healthcare Representative 76 8.74%
Human Resources 27 3.10%
Laboratory Technician 153 17.6%
Manager 51 5.86%
Manufacturing Director 87 10.0%
Research Director 51 5.86%
Research Scientist 172 19.8%
Sales Executive 200 23.0%
Sales Representative 53 6.09%

Sales Representatives have the highest attrition rate by far. Managers and Research Directors have the highest pay and low attrition rates.

Marital Status

MaritalStatus Count Proportion
Divorced 191 22.0%
Married 410 47.1%
Single 269 30.9%

Single employees have the highest attrition and below average pay.

Num Companies Worked

I would like to think this is the number of companies worked before this one because the minimum is zero, so those in that category would have only worked at this company. Otherwise I don’t know how zero makes sense unless we are counting employees who have been hired but not started.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.728   4.000   9.000 

It seems like those that attrite have worked at fewer companies?

Over Time

OverTime Count Proportion
No 618 71.0%
Yes 252 29.0%

Overtime eligible employees have much higher attrition rates and slightly lower monthly income.

Stock Option Level

StockOptionLevel Count Proportion
0 379 43.6%
1 355 40.8%
2 81 9.31%
3 55 6.32%

Most employees fall in Stock Option level 0 or 1, with much fewer in 2 & 3. Stock option levels 0 & 3 have the highest attrition, with 1 & 2 being much lower. Perhaps 1 & 2 have more long term incentives to keep employees at the firm. 1 & 2 also have higher median / average pay.

Daily, Hourly, and Monthly Rates

I am not seeing any meaningful relationship to attrition or Monthly Income for Daily Rate

I am not seeing any meaningful relationship to attrition or Monthly Income for Hourly Rate, and Hourly Rate doesn’t seem related to Daily Rate. I would normally assoicate Hourly/Daily rates with income but in this case either this is not true or somehow doesn’t translate to monthly income. I will probably ignore these.

Other Variables

  • ID
    • EmployeeNumber
    • EmployeeCount
    • Over18
    • StandardHours

These variables are not particularly useful for analysis because they either are individual IDs or do not vary. Every row is one employee, every employee is over 18, and every employee has standard hours of 80. 18 is a standard working age, 80 hours would seem high if it was a weekly number but don’t have any information about it other than it is constant.

Additional Analysis

T-Tests

Variable P-Value
Age 0.00005
Monthly Income 0.00000
Distance From Home 0.01641
Companies Worked For 0.09788
Salary Hike 0.66923
Total Work Years 0.00000
Years at Company 0.00026
Years in Role 0.00000
Years Since Promotion 0.89832
Years With Manager 0.00001

Satisfaction

I want to look at cross sections of the satisfaction variables to see what happens when people are dissatisfied in more than one area.

TotalSatisfaction Count Proportion
1 24 2.76%
2 296 34.0%
3 455 52.3%
4 95 10.9%

So the most dissatisfied people leave at much higher rates and the most satisfied rarely leave. That said the majority fit into the average satifaction bucket (“3”) so I am not sure how much performance it will add to the attrition model.

Performance and Salary

There are only two ratings, I wanted confirmation that 4 is a better rating, and this plot shows those that have a rating of 4 get larger salary increases.

Variable Importance

Trying to get an idea of what variables might be important for modeling

                                          No         Yes
Age                               6.52185425  6.52185425
BusinessTravelTravel_Frequently  -0.26541400 -0.26541400
BusinessTravelTravel_Rarely       0.46717934  0.46717934
DepartmentResearch & Development  1.19310689  1.19310689
DepartmentSales                   2.55332552  2.55332552
DistanceFromHome                  1.98471359  1.98471359
Education2                       -1.06722009 -1.06722009
Education3                       -1.40306383 -1.40306383
Education4                       -0.01705006 -0.01705006
Education5                       -1.88722037 -1.88722037
EducationFieldLife Sciences       0.30416695  0.30416695
EducationFieldMarketing           4.71069204  4.71069204
EducationFieldMedical             1.04794352  1.04794352
EducationFieldOther              -2.80707947 -2.80707947
EducationFieldTechnical Degree   -0.56185017 -0.56185017
EnvironmentSatisfaction2         -0.88218182 -0.88218182
EnvironmentSatisfaction3         -0.51665115 -0.51665115
EnvironmentSatisfaction4          2.27915035  2.27915035
GenderMale                       -0.82319377 -0.82319377
JobInvolvement2                  -2.00022141 -2.00022141
JobInvolvement3                   0.85106244  0.85106244
JobInvolvement4                  -0.10973691 -0.10973691
JobLevel2                         1.44956652  1.44956652
JobLevel3                        -0.40884542 -0.40884542
JobLevel4                        -0.23505241 -0.23505241
JobLevel5                        -1.19875153 -1.19875153
JobRoleHuman Resources           -0.29187424 -0.29187424
JobRoleLaboratory Technician      1.59762428  1.59762428
JobRoleManager                   -0.29298418 -0.29298418
JobRoleManufacturing Director     0.75891419  0.75891419
JobRoleResearch Director          1.65859574  1.65859574
JobRoleResearch Scientist         2.99278442  2.99278442
JobRoleSales Executive           -0.85877930 -0.85877930
JobRoleSales Representative       1.90627863  1.90627863
JobSatisfaction2                 -0.50095252 -0.50095252
JobSatisfaction3                 -0.04278634 -0.04278634
JobSatisfaction4                  2.64712819  2.64712819
 [ reached 'max' / getOption("max.print") -- omitted 22 rows ]

rf variable importance

  only 20 most important variables shown (out of 59)

                              Overall
JobLevel3                     100.000
JobLevel2                      86.050
TotalWorkingYears              44.842
JobRoleResearch Director       35.647
JobLevel4                      31.699
JobLevel5                      24.098
JobRoleManager                 22.089
JobRoleLaboratory Technician   22.035
JobRoleSales Executive         18.454
YearsAtCompany                 15.715
YearsInCurrentRole             14.394
JobRoleResearch Scientist      13.504
JobRoleManufacturing Director  12.843
Age                            12.065
JobRoleSales Representative    10.289
DepartmentSales                10.162
BusinessTravelTravel_Rarely     9.984
Education5                      9.708
NumCompaniesWorked              9.325
YearsWithCurrManager            9.218
rf variable importance

  only 20 most important variables shown (out of 59)

                              Overall
JobLevel2                     100.000
JobLevel3                      78.826
TotalWorkingYears              59.557
JobRoleResearch Director       27.147
JobLevel4                      25.509
JobRoleLaboratory Technician   25.354
JobRoleSales Executive         19.897
JobRoleManager                 17.915
JobLevel5                      17.901
YearsAtCompany                 16.193
JobRoleResearch Scientist      15.634
Age                            13.245
YearsInCurrentRole             13.118
JobRoleManufacturing Director  12.688
YearsWithCurrManager           12.278
YearsSinceLastPromotion        12.185
DepartmentSales                 9.979
JobRoleSales Representative     9.898
JobSatisfaction2                9.486
BusinessTravelTravel_Rarely     9.188

Linear Model

# A tibble: 60 x 5
   term                               estimate std.error statistic  p.value
   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                         3026.      827.       3.66  0.000272
 2 Age                                   -4.13      6.05    -0.683 0.495   
 3 AttritionYes                         -54.4     124.      -0.438 0.661   
 4 BusinessTravelTravel_Frequently       50.2     149.       0.338 0.736   
 5 BusinessTravelTravel_Rarely          321.      126.       2.55  0.0110  
 6 `DepartmentResearch & Development`   344.      574.       0.600 0.549   
 7 DepartmentSales                     -190.      581.      -0.328 0.743   
 8 DistanceFromHome                      -3.88      4.78    -0.813 0.416   
 9 Education2                            17.9     142.       0.126 0.900   
10 Education3                           -20.2     131.      -0.154 0.878   
# ... with 50 more rows
[1] 1161.022
# A tibble: 60 x 5
   term                              estimate std.error statistic   p.value
   <chr>                                <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                        8.01e+0   0.178     45.0    3.49e-200
 2 Age                                3.90e-5   0.00130    0.0299 9.76e-  1
 3 AttritionYes                      -5.93e-2   0.0267    -2.22   2.70e-  2
 4 BusinessTravelTravel_Frequently    3.06e-2   0.0320     0.955  3.40e-  1
 5 BusinessTravelTravel_Rarely        7.39e-2   0.0271     2.73   6.56e-  3
 6 `DepartmentResearch & Developme~   5.13e-2   0.124      0.415  6.78e-  1
 7 DepartmentSales                   -1.05e-2   0.125     -0.0842 9.33e-  1
 8 DistanceFromHome                  -4.83e-4   0.00103   -0.470  6.39e-  1
 9 Education2                        -3.74e-3   0.0306    -0.122  9.03e-  1
10 Education3                        -2.18e-2   0.0282    -0.771  4.41e-  1
# ... with 50 more rows
[1] 1206.878
# A tibble: 14 x 5
   term                            estimate std.error statistic p.value
   <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                      3471.      197.     17.6    0      
 2 `JobRoleHuman Resources`         -926.      298.     -3.11   0.00197
 3 `JobRoleLaboratory Technician`  -1134.      190.     -5.96   0      
 4 JobRoleManager                   3184.      256.     12.5    0      
 5 `JobRoleManufacturing Director`   130.      172.      0.754  0.451  
 6 `JobRoleResearch Director`       3368.      224.     15.0    0      
 7 `JobRoleResearch Scientist`      -900.      193.     -4.66   0      
 8 `JobRoleSales Executive`           -7.16    148.     -0.0482 0.962  
 9 `JobRoleSales Representative`   -1122.      236.     -4.75   0      
10 JobLevel2                        1764.      151.     11.7    0      
11 JobLevel3                        5070.      207.     24.5    0      
12 JobLevel4                        8458.      307.     27.6    0      
13 JobLevel5                       11185.      362.     30.9    0      
14 TotalWorkingYears                  48.2       8.71    5.54   0      
[1] 1105.254
# A tibble: 14 x 5
   term                            estimate std.error statistic  p.value
   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                      8.02      0.0424   189.     0       
 2 `JobRoleHuman Resources`        -0.176     0.0642    -2.75   0.00617 
 3 `JobRoleLaboratory Technician`  -0.227     0.0410    -5.53   0       
 4 JobRoleManager                   0.260     0.0550     4.73   0       
 5 `JobRoleManufacturing Director`  0.0144    0.0370     0.390  0.697   
 6 `JobRoleResearch Director`       0.289     0.0483     5.98   0       
 7 `JobRoleResearch Scientist`     -0.167     0.0416    -4.01   0.000070
 8 `JobRoleSales Executive`        -0.00218   0.0319    -0.0682 0.946   
 9 `JobRoleSales Representative`   -0.267     0.0509    -5.25   0       
10 JobLevel2                        0.504     0.0325    15.5    0       
11 JobLevel3                        0.956     0.0445    21.5    0       
12 JobLevel4                        1.16      0.0660    17.5    0       
13 JobLevel5                        1.29      0.0779    16.6    0       
14 TotalWorkingYears                0.0106    0.00187    5.64   0       
[1] 1149.049
# A tibble: 25 x 5
   term                            estimate std.error statistic p.value
   <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                       2891.       612.     4.72    0    
 2 `JobRoleHuman Resources`          -590.       620.    -0.952   0.341
 3 `JobRoleLaboratory Technician`   -1126.       189.    -5.95    0    
 4 JobRoleManager                    3414.       299.    11.4     0    
 5 `JobRoleManufacturing Director`     78.1      171.     0.456   0.648
 6 `JobRoleResearch Director`        3342.       226.    14.8     0    
 7 `JobRoleResearch Scientist`       -890.       192.    -4.63    0    
 8 `JobRoleSales Executive`           528.       350.     1.51    0.132
 9 `JobRoleSales Representative`     -547.       395.    -1.38    0.167
10 JobLevel2                         1795.       153.    11.7     0    
# ... with 15 more rows
[1] 1122.661
term estimate std.error statistic p.value
(Intercept) 3192.53689 220.69123 14.46608 0.00000
JobRoleHuman Resources -930.12879 295.44697 -3.14821 0.00171
JobRoleLaboratory Technician -1132.61843 188.91911 -5.99526 0.00000
JobRoleManager 3160.54577 253.88244 12.44886 0.00000
JobRoleManufacturing Director 90.16329 170.77172 0.52798 0.59769
JobRoleResearch Director 3343.20685 222.71514 15.01113 0.00000
JobRoleResearch Scientist -891.30499 191.48330 -4.65474 0.00000
JobRoleSales Executive -25.79427 147.21420 -0.17522 0.86096
JobRoleSales Representative -1100.13921 234.98589 -4.68172 0.00000
JobLevel2 1802.42014 149.93289 12.02151 0.00000
JobLevel3 5113.20845 205.50749 24.88089 0.00000
JobLevel4 8498.56964 304.55243 27.90511 0.00000
JobLevel5 11162.93559 359.30883 31.06780 0.00000
TotalWorkingYears 48.31583 8.63735 5.59382 0.00000
BusinessTravelTravel_Frequently 71.35281 143.71348 0.49649 0.61971
BusinessTravelTravel_Rarely 352.26220 122.05269 2.88615 0.00402
Variable 2.5 % 97.5 %
(Intercept) 2759.22104 3625.85273
JobRoleHuman Resources -1510.22368 -350.03390
JobRoleLaboratory Technician -1503.55136 -761.68549
JobRoleManager 2662.06068 3659.03086
JobRoleManufacturing Director -245.13818 425.46476
JobRoleResearch Director 2905.91715 3780.49654
JobRoleResearch Scientist -1267.27258 -515.33740
JobRoleSales Executive -314.84177 263.25322
JobRoleSales Representative -1561.52189 -638.75653
JobLevel2 1508.03465 2096.80564
JobLevel3 4709.70508 5516.71181
JobLevel4 7900.59664 9096.54263
JobLevel5 10457.45121 11868.41996
TotalWorkingYears 31.35683 65.27483
BusinessTravelTravel_Frequently -210.82119 353.52681
BusinessTravelTravel_Rarely 112.61803 591.90636
[1] 1120.712

Random Forest (Monthly Income)

rf variable importance

  only 20 most important variables shown (out of 24)

                                Overall
JobLevel2                       100.000
JobLevel3                        95.225
TotalWorkingYears                51.181
JobRoleResearch Director         39.175
JobLevel4                        29.971
JobRoleSales Executive           24.835
YearsAtCompany                   23.536
JobRoleManager                   22.585
JobRoleLaboratory Technician     21.594
YearsInCurrentRole               20.869
BusinessTravelTravel_Rarely      17.844
JobLevel5                        17.585
JobRoleManufacturing Director    15.224
JobRoleSales Representative      10.796
JobRoleResearch Scientist        10.669
Education2                       10.218
Age                               9.877
DepartmentSales                   9.393
BusinessTravelTravel_Frequently   7.707
Education5                        5.336
[1] 1088.056
rf variable importance

                                Overall
JobLevel2                       100.000
JobLevel3                        96.368
TotalWorkingYears                41.868
JobRoleResearch Director         35.650
JobLevel4                        23.701
JobRoleManufacturing Director    22.953
JobLevel5                        20.545
JobRoleLaboratory Technician     19.717
BusinessTravelTravel_Rarely      18.106
JobRoleManager                   15.003
JobRoleSales Executive           13.188
JobRoleResearch Scientist        10.702
JobRoleSales Representative       9.541
JobRoleHuman Resources            7.933
BusinessTravelTravel_Frequently   0.000
[1] 1092.35

Naive Bayes

Lets quick try a model with everything in it.

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  86   6
       Yes 59  21
                                          
               Accuracy : 0.6221          
                 95% CI : (0.5451, 0.6948)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2062          
                                          
 Mcnemar's Test P-Value : 1.12e-10        
                                          
            Sensitivity : 0.7778          
            Specificity : 0.5931          
         Pos Pred Value : 0.2625          
         Neg Pred Value : 0.9348          
             Prevalence : 0.1570          
         Detection Rate : 0.1221          
   Detection Prevalence : 0.4651          
      Balanced Accuracy : 0.6854          
                                          
       'Positive' Class : Yes             
                                          
[1] 0.7257384

Not too bad. Specificity is just under the desired threshold.

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  126  14
       Yes  19  13
                                          
               Accuracy : 0.8081          
                 95% CI : (0.7412, 0.8641)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9107          
                                          
                  Kappa : 0.3259          
                                          
 Mcnemar's Test P-Value : 0.4862          
                                          
            Sensitivity : 0.48148         
            Specificity : 0.86897         
         Pos Pred Value : 0.40625         
         Neg Pred Value : 0.90000         
             Prevalence : 0.15698         
         Detection Rate : 0.07558         
   Detection Prevalence : 0.18605         
      Balanced Accuracy : 0.67522         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8842105
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  101  10
       Yes  44  17
                                         
               Accuracy : 0.686          
                 95% CI : (0.611, 0.7545)
    No Information Rate : 0.843          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.2157         
                                         
 Mcnemar's Test P-Value : 7.098e-06      
                                         
            Sensitivity : 0.62963        
            Specificity : 0.69655        
         Pos Pred Value : 0.27869        
         Neg Pred Value : 0.90991        
             Prevalence : 0.15698        
         Detection Rate : 0.09884        
   Detection Prevalence : 0.35465        
      Balanced Accuracy : 0.66309        
                                         
       'Positive' Class : Yes            
                                         
[1] 0.7890625
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  107  11
       Yes  38  16
                                          
               Accuracy : 0.7151          
                 95% CI : (0.6414, 0.7812)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999930       
                                          
                  Kappa : 0.2349          
                                          
 Mcnemar's Test P-Value : 0.0002038       
                                          
            Sensitivity : 0.59259         
            Specificity : 0.73793         
         Pos Pred Value : 0.29630         
         Neg Pred Value : 0.90678         
             Prevalence : 0.15698         
         Detection Rate : 0.09302         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.66526         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8136882
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  106  12
       Yes  39  15
                                          
               Accuracy : 0.7035          
                 95% CI : (0.6292, 0.7706)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999987       
                                          
                  Kappa : 0.2037          
                                          
 Mcnemar's Test P-Value : 0.0002719       
                                          
            Sensitivity : 0.55556         
            Specificity : 0.73103         
         Pos Pred Value : 0.27778         
         Neg Pred Value : 0.89831         
             Prevalence : 0.15698         
         Detection Rate : 0.08721         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.64330         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8060837
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  100   8
       Yes  45  19
                                          
               Accuracy : 0.6919          
                 95% CI : (0.6171, 0.7599)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2525          
                                          
 Mcnemar's Test P-Value : 7.615e-07       
                                          
            Sensitivity : 0.7037          
            Specificity : 0.6897          
         Pos Pred Value : 0.2969          
         Neg Pred Value : 0.9259          
             Prevalence : 0.1570          
         Detection Rate : 0.1105          
   Detection Prevalence : 0.3721          
      Balanced Accuracy : 0.6967          
                                          
       'Positive' Class : Yes             
                                          
[1] 0.7905138

Random Forest (Attrition)

rf variable importance

  only 20 most important variables shown (out of 39)

                                 Importance
OverTimeYes                          100.00
MonthlyIncome                         65.18
StockOptionLevel1                     54.04
Age                                   52.08
YearsAtCompany                        46.75
TotalWorkingYears                     46.59
YearsWithCurrManager                  40.79
MaritalStatusSingle                   38.38
EnvironmentSatisfaction4              33.13
JobRoleSales Representative           31.49
YearsInCurrentRole                    31.19
DepartmentSales                       30.22
NumCompaniesWorked                    29.98
JobRoleResearch Director              29.51
PerformanceRating4                    27.53
DepartmentResearch & Development      26.43
EducationFieldMarketing               25.64
JobSatisfaction4                      25.04
JobRoleResearch Scientist             22.25
EducationFieldMedical                 19.98
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  136  21
       Yes   9   6
                                          
               Accuracy : 0.8256          
                 95% CI : (0.7605, 0.8791)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.77179         
                                          
                  Kappa : 0.1955          
                                          
 Mcnemar's Test P-Value : 0.04461         
                                          
            Sensitivity : 0.9379          
            Specificity : 0.2222          
         Pos Pred Value : 0.8662          
         Neg Pred Value : 0.4000          
             Prevalence : 0.8430          
         Detection Rate : 0.7907          
   Detection Prevalence : 0.9128          
      Balanced Accuracy : 0.5801          
                                          
       'Positive' Class : No              
                                          
[1] 0.9006623
rf variable importance

  only 20 most important variables shown (out of 37)

                              Importance
OverTimeYes                       100.00
MonthlyIncome                      74.86
TotalWorkingYears                  59.37
MaritalStatusSingle                49.32
YearsInCurrentRole                 44.86
JobSatisfaction4                   41.48
JobRoleResearch Scientist          41.18
JobRoleSales Representative        39.58
Age                                38.51
StockOptionLevel1                  36.72
JobSatisfaction3                   34.98
EducationFieldMarketing            34.36
JobRoleSales Executive             33.07
RelationshipSatisfaction2          31.19
DistanceFromHome                   30.68
NumCompaniesWorked                 29.32
MaritalStatusMarried               28.47
JobRoleManufacturing Director      26.30
RelationshipSatisfaction3          25.63
EnvironmentSatisfaction3           24.84
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  137  22
       Yes   8   5
                                          
               Accuracy : 0.8256          
                 95% CI : (0.7605, 0.8791)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.77179         
                                          
                  Kappa : 0.1648          
                                          
 Mcnemar's Test P-Value : 0.01762         
                                          
            Sensitivity : 0.9448          
            Specificity : 0.1852          
         Pos Pred Value : 0.8616          
         Neg Pred Value : 0.3846          
             Prevalence : 0.8430          
         Detection Rate : 0.7965          
   Detection Prevalence : 0.9244          
      Balanced Accuracy : 0.5650          
                                          
       'Positive' Class : No              
                                          
[1] 0.9013158
rf variable importance

                       Importance
OverTimeYes               100.000
StockOptionLevel1          59.833
MonthlyIncome              54.840
StockOptionLevel2          29.863
JobLevel2                  28.596
YearsInCurrentRole         27.278
MaritalStatusSingle        23.755
WorkingYearsGroup10-15     21.730
MaritalStatusMarried       17.500
JobLevel3                  14.601
JobLevel4                  13.983
StockOptionLevel3           9.976
WorkingYearsGroup5-10       8.232
JobInvolvement3             7.523
JobInvolvement4             7.089
JobInvolvement2             6.192
WorkingYearsGroup15-20      3.295
JobLevel5                   1.091
WorkingYearsGroup20-40      0.000

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  137  21
       Yes   8   6
                                          
               Accuracy : 0.8314          
                 95% CI : (0.7669, 0.8841)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.70582         
                                          
                  Kappa : 0.2078          
                                          
 Mcnemar's Test P-Value : 0.02586         
                                          
            Sensitivity : 0.9448          
            Specificity : 0.2222          
         Pos Pred Value : 0.8671          
         Neg Pred Value : 0.4286          
             Prevalence : 0.8430          
         Detection Rate : 0.7965          
   Detection Prevalence : 0.9186          
      Balanced Accuracy : 0.5835          
                                          
       'Positive' Class : No              
                                          
[1] 0.9042904

Score final datasets

William Arnost

2019-08-18